Introduction

Suicide is the deliberate act of ending one’s own life, often stemming from various mental disorders such as depression, bipolar disorder, autism, schizophrenia, and personality disorders, as well as external stressors like financial struggles, academic pressures, relationship issues, or experiences of harassment and bullying. Factors like substance abuse, including alcoholism and benzodiazepine use, also contribute to this tragic outcome. Previous suicide attempts significantly elevate the risk of future attempts. Efforts to prevent suicide involve a multifaceted approach, including restricting access to common methods like firearms, drugs, and poisons, addressing mental health issues and substance abuse, responsible media reporting on suicides, and fostering better economic conditions. Despite the widespread availability of crisis hotlines, their effectiveness remains inadequately researched. The prevalence and methods of suicide vary across countries, often influenced by the accessibility of lethal means. Hanging, pesticide ingestion, poisoning, and firearms are among the most commonly used methods. Globally, suicides claim over 700,000 lives annually, ranking suicide as the 10th leading cause of death worldwide. Approximately 1.5% of people die by suicide, translating to roughly 12 per 100,000 individuals each year. Men are more likely to die by suicide than women, with rates ranging from 1.5 times higher in developing countries to 3.5 times higher in developed ones. Financial strain often exacerbates the risk of suicide. [1]

Research Question

Suicide is not confined to high-income countries; it is a global issue affecting all regions. Surprisingly, over 77% of suicides occur in low- and middle-income countries. However, high-income countries exhibit the highest age-standardized suicide rates. In ongoing research across 19 countries, we explore correlations between GDP, bankruptcy rates, happiness levels, and suicide rates. Understanding these intricate relationships is pivotal for policymakers, researchers, and stakeholders to devise effective interventions promoting mental well-being and economic resilience. Leveraging multidimensional datasets and advanced analytical methods, our project aims to illuminate these connections for the betterment of societies worldwide.

Data Wrangling

# Load libraries
library(dplyr)
library(readr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(stringr)
library(readxl)
library(httr)

Exploration of the Bankruptcies Dataset


# Load dataset
bankruptcies <- read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/raw_data/Bankruptcies_2011-2020.csv")
New names:Rows: 987 Columns: 19── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (13): COU, Country, VAR, Variable, MEA, Measure, ISIC4...7, ISIC4...8, TIME, Time, Unit Code, Unit, PowerCode
dbl  (2): PowerCode Code, Value
lgl  (4): Reference Period Code, Reference Period, Flag Codes, Flags
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display structure of dataset
str(bankruptcies)
spc_tbl_ [987 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ COU                  : chr [1:987] "CAN" "CAN" "CAN" "CAN" ...
 $ Country              : chr [1:987] "Canada" "Canada" "Canada" "Canada" ...
 $ VAR                  : chr [1:987] "BANKRUPTCIES" "BANKRUPTCIES" "BANKRUPTCIES" "BANKRUPTCIES" ...
 $ Variable             : chr [1:987] "Number of bankruptcies" "Number of bankruptcies" "Number of bankruptcies" "Number of bankruptcies" ...
 $ MEA                  : chr [1:987] "INDEX" "INDEX" "INDEX" "INDEX" ...
 $ Measure              : chr [1:987] "Index 2007=100" "Index 2007=100" "Index 2007=100" "Index 2007=100" ...
 $ ISIC4...7            : chr [1:987] "01_99" "01_99" "01_99" "01_99" ...
 $ ISIC4...8            : chr [1:987] "Grand Total" "Grand Total" "Grand Total" "Grand Total" ...
 $ TIME                 : chr [1:987] "2011-Q1" "2011-Q2" "2011-Q3" "2011-Q4" ...
 $ Time                 : chr [1:987] "Q1-2011" "Q2-2011" "Q3-2011" "Q4-2011" ...
 $ Unit Code            : chr [1:987] "IDX" "IDX" "IDX" "IDX" ...
 $ Unit                 : chr [1:987] "Index" "Index" "Index" "Index" ...
 $ PowerCode Code       : num [1:987] 0 0 0 0 0 0 0 0 0 0 ...
 $ PowerCode            : chr [1:987] "Units" "Units" "Units" "Units" ...
 $ Reference Period Code: logi [1:987] NA NA NA NA NA NA ...
 $ Reference Period     : logi [1:987] NA NA NA NA NA NA ...
 $ Value                : num [1:987] 60 58.3 56.3 55.3 54 ...
 $ Flag Codes           : logi [1:987] NA NA NA NA NA NA ...
 $ Flags                : logi [1:987] NA NA NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   COU = col_character(),
  ..   Country = col_character(),
  ..   VAR = col_character(),
  ..   Variable = col_character(),
  ..   MEA = col_character(),
  ..   Measure = col_character(),
  ..   ISIC4...7 = col_character(),
  ..   ISIC4...8 = col_character(),
  ..   TIME = col_character(),
  ..   Time = col_character(),
  ..   `Unit Code` = col_character(),
  ..   Unit = col_character(),
  ..   `PowerCode Code` = col_double(),
  ..   PowerCode = col_character(),
  ..   `Reference Period Code` = col_logical(),
  ..   `Reference Period` = col_logical(),
  ..   Value = col_double(),
  ..   `Flag Codes` = col_logical(),
  ..   Flags = col_logical()
  .. )
 - attr(*, "problems")=<externalptr> 
# Show the first few rows 
head(bankruptcies)

# Summary statistics of numerical attributes
summary(bankruptcies)
     COU              Country              VAR              Variable             MEA              Measure         
 Length:987         Length:987         Length:987         Length:987         Length:987         Length:987        
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                                                  
                                                                                                                  
                                                                                                                  
  ISIC4...7          ISIC4...8             TIME               Time            Unit Code             Unit          
 Length:987         Length:987         Length:987         Length:987         Length:987         Length:987        
 Class :character   Class :character   Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                                                                  
                                                                                                                  
                                                                                                                  
 PowerCode Code  PowerCode         Reference Period Code Reference Period     Value        Flag Codes    
 Min.   :0      Length:987         Mode:logical          Mode:logical     Min.   : 32.09   Mode:logical  
 1st Qu.:0      Class :character   NA's:987              NA's:987         1st Qu.: 97.05   NA's:987      
 Median :0      Mode  :character                                          Median :121.18                 
 Mean   :0                                                                Mean   :139.36                 
 3rd Qu.:0                                                                3rd Qu.:146.24                 
 Max.   :0                                                                Max.   :949.42                 
  Flags        
 Mode:logical  
 NA's:987      
               
               
               
               
# Summary statistics of categorical attribute
table(bankruptcies$Country)

     Australia        Belgium         Brazil         Canada        Denmark        Finland         France 
            41             82             36             40             40             80             40 
       Germany        Iceland          Italy          Japan    Netherlands    New Zealand         Norway 
            80             40             40             40             80             28             80 
  South Africa          Spain         Sweden United Kingdom  United States 
            38             40             82             40             40 
# Check for missing values
sum(is.na(bankruptcies))
[1] 3948
# Removing NA's columns and Time (Duplicate)  
data <- select(bankruptcies, -`Flag Codes`, -Flags, -`Reference Period Code`, -`Reference Period`, -TIME)
data %>% 
  sample_n(10)

# Display the structure of the modified dataset
str(data)
tibble [987 × 14] (S3: tbl_df/tbl/data.frame)
 $ COU           : chr [1:987] "CAN" "CAN" "CAN" "CAN" ...
 $ Country       : chr [1:987] "Canada" "Canada" "Canada" "Canada" ...
 $ VAR           : chr [1:987] "BANKRUPTCIES" "BANKRUPTCIES" "BANKRUPTCIES" "BANKRUPTCIES" ...
 $ Variable      : chr [1:987] "Number of bankruptcies" "Number of bankruptcies" "Number of bankruptcies" "Number of bankruptcies" ...
 $ MEA           : chr [1:987] "INDEX" "INDEX" "INDEX" "INDEX" ...
 $ Measure       : chr [1:987] "Index 2007=100" "Index 2007=100" "Index 2007=100" "Index 2007=100" ...
 $ ISIC4...7     : chr [1:987] "01_99" "01_99" "01_99" "01_99" ...
 $ ISIC4...8     : chr [1:987] "Grand Total" "Grand Total" "Grand Total" "Grand Total" ...
 $ Time          : chr [1:987] "Q1-2011" "Q2-2011" "Q3-2011" "Q4-2011" ...
 $ Unit Code     : chr [1:987] "IDX" "IDX" "IDX" "IDX" ...
 $ Unit          : chr [1:987] "Index" "Index" "Index" "Index" ...
 $ PowerCode Code: num [1:987] 0 0 0 0 0 0 0 0 0 0 ...
 $ PowerCode     : chr [1:987] "Units" "Units" "Units" "Units" ...
 $ Value         : num [1:987] 60 58.3 56.3 55.3 54 ...
# Check the dimension of data
dim(data)
[1] 987  14

For this project, we are only focusing on information from 2011 to 2020. And to minimize the workload, we are analyzing 19 countries.

# Create new columns 'Quarter' and 'Year' from 'Time'
new_data <- mutate(data,
                   Quarter = str_sub(Time, 1, 2), # Extract Quarter
                   Year = as.numeric(str_sub(Time, 4))) # Extract Year and convert to numeric

# remove the original 'Time' column
new_data <- select(new_data, -Time)  
new_data

# Get date range
date_range <- range(new_data$Year)

# Count unique countries
n_countries <- length(unique(new_data$Country))

# Print the results
n_countries
[1] 19
date_range
[1] 2011 2021
# Only need data from 2011-2020, excluding 2021
bankruptcy_data <- filter(new_data, Year != 2021)
new_data

# Get date range of new updated dataset
date_range <- range(bankruptcy_data$Year)

# Extract start and end years from the range
start_year <- date_range[1]
end_year <- date_range[2]

# Count unique countries
n_countries <- length(unique(bankruptcy_data$Country))

# Print the results
cat("Number of country: ", n_countries, "\n") 
Number of country:  19 
cat("Date range from: ", start_year, " to ", end_year, "\n")
Date range from:  2011  to  2020 

Exploring GDP dataset & analyze foundings

# Load dataset 
my_data <- read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/raw_data/GDP_Data/GDP_capita_1960_2022.csv")
Rows: 266 Columns: 10── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Country Name
dbl (9): 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
my_data

The Years are in the header, using pivot_longer() to convert into two column Year and GDP

# Pivot the data from wide to long format
joy <- pivot_longer(my_data, 
                         cols = -c(`Country Name`),
                         names_to = "Year", 
                         values_to = "GDP")

# Show the first few rows

Perfom summary stastics, check for NA's, dimension dim(), etc.

# Display structure of dataset
str(joy)
tibble [2,394 × 3] (S3: tbl_df/tbl/data.frame)
 $ Country Name: chr [1:2394] "Aruba" "Aruba" "Aruba" "Aruba" ...
 $ Year        : chr [1:2394] "2011" "2012" "2013" "2014" ...
 $ GDP         : num [1:2394] 26043 25611 26515 26940 28419 ...
# Show the first few rows 
head(joy)

# Summary statistics of numerical attributes
summary(joy)
 Country Name           Year                GDP        
 Length:2394        Length:2394        Min.   :   217  
 Class :character   Class :character   1st Qu.:  2099  
 Mode  :character   Mode  :character   Median :  6570  
                                       Mean   : 16591  
                                       3rd Qu.: 19866  
                                       Max.   :199383  
                                       NA's   :66      
# Summary statistics of categorical attribute
table(joy$`Indicator Code`)
Warning: Unknown or uninitialised column: `Indicator Code`.
< table of extent 0 >
# Check for missing values
sum(is.na(joy))
[1] 66
# Check the dimension of data
dim(joy)
[1] 2394    3

Filter the data to only show the 19 countries and the date range the project is focused on.

# Define the list of 19 countries
countries_of_interest <- c("Australia", "Belgium", "Brazil", "Canada", "Denmark",
                           "Finland", "France", "Germany", "Iceland", "Italy",
                           "Japan", "Netherlands", "New Zealand", "Norway", 
                           "South Africa", "Spain", "Sweden", "United Kingdom", 
                           "United States")

# Filter the data for the 19 countries
gdp <- joy %>%
  filter(`Country Name` %in% countries_of_interest)

# Define the date range of focus
start_year <- 2011
end_year <- 2020

# Filter the data for the date range
gdp <- gdp %>%
  filter(as.numeric(Year) >= start_year & as.numeric(Year) <= end_year)

# Show the filtered data
gdp %>% 
  sample_n(10)
# Display structure of dataset
str(gdp)
tibble [171 × 3] (S3: tbl_df/tbl/data.frame)
 $ Country Name: chr [1:171] "Australia" "Australia" "Australia" "Australia" ...
 $ Year        : chr [1:171] "2011" "2012" "2013" "2014" ...
 $ GDP         : num [1:171] 62610 68078 68198 62558 56759 ...
# Convert Year to numeric 
# gdp$Year <- as.numeric(gdp$Year)

# Show the first few rows 
head(gdp)

# Summary statistics of numerical attributes
summary(gdp)
 Country Name           Year                GDP        
 Length:171         Length:171         Min.   :  5735  
 Class :character   Class :character   1st Qu.: 38808  
 Mode  :character   Mode  :character   Median : 46299  
                                       Mean   : 45324  
                                       3rd Qu.: 53541  
                                       Max.   :103554  
# Summary statistics of categorical attribute
table(gdp$`Indicator Code`)
Warning: Unknown or uninitialised column: `Indicator Code`.
< table of extent 0 >
# Check for missing values
sum(is.na(gdp))
[1] 0
# Check the dimension of data
dim(gdp)
[1] 171   3
# Group the filtered data by Year and calculate the mean GDP for each year
mean_gdp_by_year <- gdp %>%
  group_by(Year) %>%
  summarize(mean_GDP = mean(GDP, na.rm = TRUE))

# Show the mean GDP for each year
print(mean_gdp_by_year)

Visualization for top 10 highest GDP by Counries Over Time.

# Filter to include only the top 10 countries with the highest GDP values
top_10_gdp <- gdp %>%
  group_by(`Country Name`) %>%
  summarize(total_gdp = sum(GDP, na.rm = TRUE)) %>%
  top_n(10, total_gdp) %>%
  left_join(gdp, by = "Country Name")

# Convert GDP values to millions or billions
top_10_gdp <- top_10_gdp %>%
  mutate(GDP_formatted = case_when(
    GDP >= 1e12 ~ paste0(round(GDP / 1e12, 1), "T"),  # Convert to trillions
    GDP >= 1e9 ~ paste0(round(GDP / 1e9, 1), "B"),  # Convert to billions
    GDP >= 1e6 ~ paste0(round(GDP / 1e6, 1), "M"),   # Convert to millions
    TRUE ~ as.character(GDP)                        # Keep unchanged if less than 1 million
  ))

# Create a line plot of GDP over time with formatted values for the top 10 countries
ggplot(top_10_gdp, aes(x = Year, y = GDP, group = `Country Name`)) +
  geom_line(aes(color = `Country Name`)) +
  labs(title = "GDP Trends Over Time (Top 10 Countries)",
       x = "Year",
       y = "GDP",
       color = "Country") +
  scale_y_continuous(labels = function(x) paste0(x, "")) +  # Ensure y-axis labels are character type
  theme_minimal()

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Exploration of the Suicide Rate Dataset

# Load data set from github
suicide_df = read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/raw_data/death-rate-from-suicides-gho%20new.csv")
Rows: 3876 Columns: 4── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Entity, Code, Age-standardized suicide rate - Sex: both sexes
dbl (1): Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
suicide_df$`Age-standardized suicide rate - Sex: both sexes` = as.double(suicide_df$`Age-standardized suicide rate - Sex: both sexes`)
Warning: NAs introduced by coercion
summary(suicide_df)
    Entity              Code                Year      Age-standardized suicide rate - Sex: both sexes
 Length:3876        Length:3876        Min.   :2000   Min.   :  0.000                                
 Class :character   Class :character   1st Qu.:2004   1st Qu.:  5.955                                
 Mode  :character   Mode  :character   Median :2009   Median : 10.015                                
                                       Mean   :2009   Mean   : 68.421                                
                                       3rd Qu.:2014   3rd Qu.:141.230                                
                                       Max.   :2019   Max.   :962.889                                
                                                      NA's   :5                                      
# Consider removing Lesotho from the analysis 
suicide_df %>% 
  filter(is.na(`Age-standardized suicide rate - Sex: both sexes`))
# Remove the Region / Income classes and Lesotho
suicide_df_countries = suicide_df %>% 
  filter(Code != "0", Code != "LSO")
avg_country = suicide_df_countries %>% 
  group_by(Entity) %>% 
  summarise(avg_per_country = mean(`Age-standardized suicide rate - Sex: both sexes`)) %>% 
  arrange(desc(avg_per_country))
ggplot(avg_country, aes(x = avg_per_country)) +
  geom_histogram() +
  labs(x = "Mean Suicide rate", y = "Frequency", title = "Distribution of mean of suicide rate") +
  theme_minimal()

Exploration of the Happiness Dataset

# Use the raw file URL
url <- "https://github.com/Alexburk93/Data_Wrangling_EDA/raw/main/data/raw_data/WHR20_DataForTable2.1.xls"
response <- GET(url)
content <- content(response, "raw")
temp <- tempfile(fileext = ".xls")
writeBin(content, temp)
happiness_df <- read_excel(temp)

# View the data
head(happiness_df)
NA
# create data frame with selected columns. Based on the description of the different variables.
happiness_df_filtered = happiness_df %>% 
  select(`Country name`, `year` , `Life Ladder`, `Social support`, `Healthy life expectancy at birth`, `Freedom to make life choices`, `Perceptions of corruption`)

summary(happiness_df_filtered)
 Country name            year       Life Ladder    Social support   Healthy life expectancy at birth
 Length:1848        Min.   :2005   Min.   :2.375   Min.   :0.2902   Min.   :32.30                   
 Class :character   1st Qu.:2010   1st Qu.:4.623   1st Qu.:0.7483   1st Qu.:58.30                   
 Mode  :character   Median :2013   Median :5.363   Median :0.8340   Median :65.10                   
                    Mean   :2013   Mean   :5.446   Mean   :0.8111   Mean   :63.17                   
                    3rd Qu.:2016   3rd Qu.:6.268   3rd Qu.:0.9046   3rd Qu.:68.39                   
                    Max.   :2019   Max.   :8.019   Max.   :0.9873   Max.   :77.10                   
                                                   NA's   :13       NA's   :52                      
 Freedom to make life choices Perceptions of corruption
 Min.   :0.2575               Min.   :0.0352           
 1st Qu.:0.6431               1st Qu.:0.6927           
 Median :0.7575               Median :0.8036           
 Mean   :0.7385               Mean   :0.7491           
 3rd Qu.:0.8524               3rd Qu.:0.8737           
 Max.   :0.9852               Max.   :0.9833           
 NA's   :31                   NA's   :103              
avg_happiness_per_country = happiness_df_filtered %>% 
  group_by(`Country name`) %>% 
  summarise(avg_happiness = mean(`Life Ladder`)) %>% 
  arrange(desc(avg_happiness))
ggplot(avg_happiness_per_country, aes(x = avg_happiness)) +
  geom_histogram() +
  labs(x = "Mean Happiness Level", y = "Frequency", title = "Distribution of mean of happiness rate") +
  theme_minimal()

Data for Final Presentation

Happiness Data for the 19 countries of interest from 2011 - 2019

# Define the list of 19 countries
countries_of_interest <- c("Australia", "Belgium", "Brazil", "Canada", "Denmark",
                           "Finland", "France", "Germany", "Iceland", "Italy",
                           "Japan", "Netherlands", "New Zealand", "Norway", 
                           "South Africa", "Spain", "Sweden", "United Kingdom", 
                           "United States")

# Filter the data for the 19 countries
happiness_new<- happiness_df_filtered %>%
  filter(`Country name` %in% countries_of_interest)

# Define the date range of focus
start_year <- 2011
end_year <- 2020

# Filter the data for the date range
happiness_data <- happiness_new %>%
  filter(as.numeric(year) >= start_year & as.numeric(year) <= end_year)

# Show the filtered data
happiness_data %>% 
  sample_n(10)
# Check if all countries of interest are present in the filtered data
missing_countries <- setdiff(countries_of_interest, unique(happiness_data$`Country name`))

# Print the missing countries, if any
if(length(missing_countries) > 0) {
  print("The following countries are missing from the filtered data:")
  print(missing_countries)
} else {
  print("All countries of interest are selected from the filtered data.")
}
[1] "All countries of interest are selected from the filtered data."
# Count the occurrences of each country in the filtered data
country_counts <- table(happiness_data$`Country name`)

# Print the country names and their counts
print("Country Name\t\tCount")
[1] "Country Name\t\tCount"
for (country in names(country_counts)) {
  cat(country, "\t\t\t", country_counts[country], "\n")
}
Australia            9 
Belgium              9 
Brazil           9 
Canada           9 
Denmark              9 
Finland              9 
France           9 
Germany              9 
Iceland              6 
Italy            9 
Japan            9 
Netherlands              9 
New Zealand              9 
Norway           7 
South Africa             9 
Spain            9 
Sweden           9 
United Kingdom           9 
United States            9 

Suicide Data for the 19 countries of interest from 2011 - 2019

# Change column name from `Entity` to `country name`
suicide_df <- rename(suicide_df, `Country name` = Entity)

# Define the list of 19 countries
countries_of_interest <- c("Australia", "Belgium", "Brazil", "Canada", "Denmark",
                           "Finland", "France", "Germany", "Iceland", "Italy",
                           "Japan", "Netherlands", "New Zealand", "Norway", 
                           "South Africa", "Spain", "Sweden", "United Kingdom", 
                           "United States")

# Filter the data for the 19 countries
new_suicide<- suicide_df %>%
  filter(`Country name` %in% countries_of_interest)

# Define the date range of focus
start_year <- 2011
end_year <- 2020

# Filter the data for the date range
suicide_data <- new_suicide %>%
  filter(as.numeric(Year) >= start_year & as.numeric(Year) <= end_year)

# Show the filtered data
suicide_data %>% 
  sample_n(10)

Label headers to align all the data set & view sample data

gdp
bankruptcy_data
# Change column name from `Entity` to `country name`
gdp_data <- rename(gdp, `Country name` = `Country Name`)
bankruptcy_data<- rename(bankruptcy_data, `Country name` = `Country`)
happiness_data = rename(happiness_data, `Year` = `year`)
suicide_data %>% 
  sample_n(10)

happiness_data %>% 
  sample_n(10)

gdp_data %>% 
  sample_n(10)

bankruptcy_data %>% 
  sample_n(10)

Merging datasets

gdp_data
happiness_data
bankruptcy_data
suicide_data
# Convert "Year" column to character type in all datasets
suicide_data <- mutate(suicide_data, Year = as.character(Year))
happiness_data <- mutate(happiness_data, Year = as.character(Year))
gdp_data <- mutate(gdp_data, Year = as.character(Year))
bankruptcy_data <- mutate(bankruptcy_data, Year = as.character(Year))

# Merge the datasets
suicide_analysis <- suicide_data %>%
  left_join(happiness_data, by = c("Year", "Country name")) %>%
  left_join(gdp_data, by = c("Year", "Country name")) %>%
  left_join(bankruptcy_data, by = c("Year", "Country name"))
# View sample data from the merged dataset
suicide_analysis 

Save the dataset in excel csv format to run analysis for the final project


# # Drop specified columns
# suicide_analysis <- suicide_analysis %>%
#   select(-c("Code", "COU", "PowerCode", "PowerCode Code"))
# 
# # Save the merged dataset as a CSV file
# # Define the path to the folder on your desktop
# desktop_path <- "/home/alex/Uni/Master_US/2_Semester/Class_Data_Wrangeling_EDA/Data_Wrangling_EDA/data/"
# 
# # Create the folder if it doesn't exist
# dir.create(desktop_path, showWarnings = FALSE)
# 
# # Save the merged dataset as a CSV file in the specified folder
# write.csv(suicide_analysis, file.path(desktop_path, "suicide_analysis_2.csv"), row.names = FALSE)
library(dplyr)
library(readr)
library(tidyverse)
library(ggplot2)
library(highcharter)
library(magrittr)

EDA

Data loading

data <- read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/suicide_analysis_2.csv")
New names:Rows: 894 Columns: 19── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Country name, VAR, Variable, MEA, Measure, ISIC4...14, ISIC4...15, Unit Code, Unit, Quarter
dbl  (9): Year, Age-standardized suicide rate - Sex: both sexes, Life Ladder, Social support, Healthy life expecta...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
                data %>% sample_n(19)

Renaming variables and drop unwanted columns

# renaming the columns 
data <- data %>%
  rename(`Country_name` = `Country name`,
         `Suicide_Rate` = `Age-standardized suicide rate - Sex: both sexes`,
         `Life_ladder` = `Life Ladder`,
         `Social_support` = `Social support`,
         `Life_expectancy` = `Healthy life expectancy at birth`,
         `Freedom_choices` = `Freedom to make life choices`,
         `Corruption` = `Perceptions of corruption`)
data
data
# drop columns 
# remove the original 'Time' column
data <- select(data, -Variable, -VAR, -MEA, -`Unit Code`)  
data

Data exploration

head(data)
# change names
names(data) <- make.names(names(data))
# dimensions of the dataframe
nrow(data)
[1] 894
ncol(data)
[1] 15
dim(data)
[1] 894  15
# check the structure of the object
str(data)
tibble [894 × 15] (S3: tbl_df/tbl/data.frame)
 $ Country_name   : chr [1:894] "Australia" "Australia" "Australia" "Australia" ...
 $ Year           : num [1:894] 2011 2011 2011 2011 2011 ...
 $ Suicide_Rate   : num [1:894] 10.1 10.1 10.1 10.1 11 ...
 $ Life_ladder    : num [1:894] 7.41 7.41 7.41 7.41 7.19 ...
 $ Social_support : num [1:894] 0.967 0.967 0.967 0.967 0.954 ...
 $ Life_expectancy: num [1:894] 72.3 72.3 72.3 72.3 72.1 ...
 $ Freedom_choices: num [1:894] 0.945 0.945 0.945 0.945 0.935 ...
 $ Corruption     : num [1:894] 0.382 0.382 0.382 0.382 0.269 ...
 $ GDP            : num [1:894] 62610 62610 62610 62610 38388 ...
 $ Measure        : chr [1:894] "Index 2007=100" "Index 2007=100" "Index 2007=100" "Index 2007=100" ...
 $ ISIC4...14     : chr [1:894] "01_99C" "01_99C" "01_99C" "01_99C" ...
 $ ISIC4...15     : chr [1:894] "Grand total (corporations only)" "Grand total (corporations only)" "Grand total (corporations only)" "Grand total (corporations only)" ...
 $ Unit           : chr [1:894] "Index" "Index" "Index" "Index" ...
 $ Value          : num [1:894] 132 138 143 145 122 ...
 $ Quarter        : chr [1:894] "Q1" "Q2" "Q3" "Q4" ...
# look at columns 6, 7 and 10
head(data[ , c(2, 4:6, 12, 15)])
# look at columns 6, 7 and 10
tail(data[ , c(1, 3, 9)])
table(data$Year)

2011 2012 2013 2014 2015 2016 2017 2018 2019 
 100  100  100  100  100  100  100   97   97 
data %>% 
  select(Country_name) %>% 
  unique() %>% 
  nrow()
[1] 19
unique(data$Country_name)
 [1] "Australia"      "New Zealand"    "United States"  "Spain"          "Netherlands"    "France"        
 [7] "Finland"        "Belgium"        "Japan"          "South Africa"   "Iceland"        "Norway"        
[13] "Sweden"         "Italy"          "Brazil"         "United Kingdom" "Germany"        "Canada"        
[19] "Denmark"       
unique(data$Year)
[1] 2011 2012 2013 2014 2015 2016 2017 2018 2019

Interactive maps

# Set highcharter options for tooltip decimals
options(highcharter.tooltip.valueDecimals = 2)

# Create highcharter map visualization
hc <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, value = "GDP", 
    joinBy = c('name', 'Country_name'),
    name = "GDP (current US$)"
  )  %>% 
  hc_colorAxis(stops = color_stops()) %>% 
  hc_title(text = "World Map") %>% 
  hc_subtitle(text = "GDP in current US$")

hc
# Set highcharter options for tooltip decimals
options(highcharter.tooltip.valueDecimals = 2)

# Create map visualizations for each variable
hc_life_expectancy <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, 
    value = "Life_expectancy", 
    joinBy = c('name', 'Country_name'),
    name = "Life Expectancy"
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_title(text = "World Map") %>%
  hc_subtitle(text = "Life Expectancy")

hc_suicide_rates <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, 
    value = "Suicide_Rate", 
    joinBy = c('name', 'Country_name'),
    name = "Suicide Rates"
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_title(text = "World Map") %>%
  hc_subtitle(text = "Suicide Rate")

hc_corruption <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, 
    value = "Corruption", 
    joinBy = c('name', 'Country_name'),
    name = "Corruption"
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_title(text = "World Map") %>%
  hc_subtitle(text = "Corruption")

# Display the map visualizations
list(hc_life_expectancy, hc_suicide_rates, hc_corruption)
[[1]]

[[2]]

[[3]]
NA

Analysis

AVG GDP over years

Calculation

avg_gdp_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_gpd = mean(`GDP`))

avg_gdp_per_year

Plot

ggplot(avg_gdp_per_year, aes(x = Year, y = avg_gpd)) +
  geom_line(color = "blue") +  
  labs(title = "Average GDP Over Time worldwide",
       x = "Year",
       y = "GDP") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_gdp_per_year$Year), max(avg_gdp_per_year$Year), by = 1))

AVG Happiness over years

Calculation

avg_happiness_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_happinnes = mean(`Life_ladder`, na.rm = T))

avg_happiness_per_year

Plot

ggplot(avg_happiness_per_year, aes(x = Year, y = avg_happinnes)) +
  geom_line(color = "blue") +  
  labs(title = "Average Happiness Over Time worldwide",
       x = "Year",
       y = "Happiness") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_happiness_per_year$Year), max(avg_happiness_per_year$Year), by = 1))

AVG Suicide Rates over years

Calculation

avg_Suicide_Rate_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_Suicide_Rate = mean(`Suicide_Rate`, na.rm = T))

avg_Suicide_Rate_per_year

Plot

ggplot(avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate)) +
  geom_line(color = "blue") +  
  labs(title = "Average Suicide Rate Over Time worldwide",
       x = "Year",
       y = "Suicide Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_Suicide_Rate_per_year$Year), max(avg_Suicide_Rate_per_year$Year), by = 1))

AVG Bankruptcies over years

Calculation

avg_Bankruptcies_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_Bankruptcies = mean(`Value`, na.rm = T))

avg_Bankruptcies_per_year

Plot

ggplot(avg_Bankruptcies_per_year, aes(x = Year, y = avg_Bankruptcies)) +
  geom_line(color = "blue") +  
  labs(title = "Average Bankruptcies Over Time worldwide",
       x = "Year",
       y = "Bankruptcies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_Bankruptcies_per_year$Year), max(avg_Bankruptcies_per_year$Year), by = 1))

Plot Average GDP and Average Suicide Rate over years

# Finding the ratio for scaling the second axis
ratio <- max(avg_gdp_per_year$avg_gpd) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_gdp_per_year, aes(x = Year, y = avg_gpd), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average GDP and Suicide Rate Over Time",
       x = "Year",
       y = "Average GDP") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_gdp_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_gdp_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

Plot Average Happinness and Average Suicide Rate over years

# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_per_year$avg_happinnes) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_happiness_per_year, aes(x = Year, y = avg_happinnes), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Happinness and Suicide Rate Over Time",
       x = "Year",
       y = "Average Happinness") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_per_year$avg_happinnes) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_happiness_per_year, aes(x = Year, y = avg_happinnes), size = 0.5) +
  # Adding the line plot for Suicide Rate adjusted by the ratio
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Setting up the titles and labels
  labs(title = "Average Happiness and Suicide Rate Over Time",
       x = "Year",
       y = "Average Happiness",
       subtitle = "Suicide rates are scaled to compare against happiness") +
  # Primary axis for Happiness, secondary axis for Suicide Rate (inversed scaling)
  scale_y_continuous(name = "Average Happiness",
                     sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  # Minimalist theme with angled x-axis texts
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  # Set x-axis breaks
  scale_x_continuous(breaks = seq(min(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

Plot Average Bankruptcies and Average Suicide Rate over years

# Finding the ratio for scaling the second axis
ratio <- max(avg_Bankruptcies_per_year$avg_Bankruptcies) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_Bankruptcies_per_year, aes(x = Year, y = avg_Bankruptcies), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Bankruptcies and Suicide Rate Over Time",
       x = "Year",
       y = "Average Bankruptcies") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_Bankruptcies_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_Bankruptcies_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

Comparisions

Two Happiest and Two unhappiest countries vs average suicide rate

avg_Suicide_Rate_per_country = data %>%
  group_by(Country_name) %>%
  summarise(avg_suicide_rate = mean(Suicide_Rate, na.rm = TRUE)) %>%
  arrange(avg_suicide_rate) %>% 
  mutate(Row_Number = row_number())

avg_Suicide_Rate_per_country

avg_happiness_per_country <- data %>%
  group_by(Country_name) %>%
  summarise(avg_happiness = mean(Life_ladder, na.rm = TRUE)) %>%
  arrange(desc(avg_happiness))

least_happy =  tail(avg_happiness_per_country, 2)
most_happy = head(avg_happiness_per_country, 2)


avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% least_happy$Country_name)

# Interpretation: Japan and South Africa are two very unhappy countries. And they also have a high suicide rate

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% most_happy$Country_name)

# Interpretation: Finland is a the second most happy country. But is still on place 16/19 when it comes to suicides

Two most wealthy countries and two most poor countries vs average suicide rate

avg_gdp_per_country <- data %>% 
  group_by (`Country_name`) %>% 
  summarise(avg_gpd = mean(`GDP`)) %>% 
  arrange(desc(avg_gpd))

least_gdp =  tail(avg_gdp_per_country, 2)
most_gdp = head(avg_gdp_per_country, 2)

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% least_gdp$Country_name)

# Interpretation: Brazil with a low GPD per Capita is still low in the suicide ranking. SA has a bad GPA per Capita and is also bad in the suicide ranking  

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% most_gdp$Country_name)

# Interpretation: The two biggest economies are based in the middle of the suicide rating

Two most bankcuptcies and two least bankcuptcies countries vs average suicide rate

avg_Bankruptcies_per_year <- data %>% 
  group_by (`Country_name`) %>% 
  summarise(avg_bankruptcies = mean(`Value`, na.rm = T)) %>% 
  arrange(desc(avg_bankruptcies))

least_bank =  tail(avg_Bankruptcies_per_year, 2)
most_bank = head(avg_Bankruptcies_per_year, 2)

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% least_bank$Country_name)

# Interpretation: Bankruptcies don't have an influence on suicide rates

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% most_bank$Country_name)

# Interpretation: Bankruptcies don't have an influence on suicide rates

In depth analysis Germany

Data preperation for Germany

# Prepare data for only Germany
germany_data = data %>% 
  filter(Country_name == "Germany")

Plot GDP Germany

# Plot Germany GDP over Years
avg_gdp_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_gdp = mean(GDP)) 


ggplot(avg_gdp_year_germany, aes(x = Year, y = avg_gdp)) +
  geom_line(color = "blue") +  
  labs(title = "Average GPD Over Time - Germany",
       x = "Year",
       y = "GDP") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_gdp_year_germany$Year), max(avg_gdp_year_germany$Year), by = 1))

Plot Suicide Rate Germany

avg_suicide_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_suicide = mean(Suicide_Rate)) 


ggplot(avg_suicide_year_germany, aes(x = Year, y = avg_suicide)) +
  geom_line(color = "blue") +  
  labs(title = "Average Suicide Rate Over Time - Germany",
       x = "Year",
       y = "Suicide Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_suicide_year_germany$Year), max(avg_suicide_year_germany$Year), by = 1))

Plot Bankruptcies Rate Germany

avg_bank_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_bank = mean(Value)) 


ggplot(avg_bank_year_germany, aes(x = Year, y = avg_bank)) +
  geom_line(color = "blue") +  
  labs(title = "Average bankruptcies Over Time - Germany",
       x = "Year",
       y = "Bankruptcies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_bank_year_germany$Year), max(avg_bank_year_germany$Year), by = 1))

Plot Happiness Rate Germany

avg_happiness_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_happy = mean(Life_ladder)) 


ggplot(avg_happiness_year_germany, aes(x = Year, y = avg_happy)) +
  geom_line(color = "blue") +  
  labs(title = "Average Happiness Over Time - Germany",
       x = "Year",
       y = "Happiness") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_happiness_year_germany$Year), max(avg_happiness_year_germany$Year), by = 1))

Plot Average GDP and Average Suicide Rate over years - Germany

avg_suicide_year_germany$avg_suicide
[1] 9.25 8.98 9.22 9.19 9.00 8.82 8.34 8.46 8.27
# Finding the ratio for scaling the second axis
ratio <- max(avg_gdp_year_germany$avg_gdp) / max(avg_suicide_year_germany$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_gdp_year_germany, aes(x = Year, y = avg_gdp), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_germany, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average GDP and Suicide Rate Over Time",
       x = "Year",
       y = "Average GDP") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_gdp_year_germany$Year, avg_suicide_year_germany$Year)), 
                                  max(c(avg_gdp_year_germany$Year, avg_suicide_year_germany$Year)), by = 1))

Plot Average Happinness and Average Suicide Rate over years - Germany

# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_year_germany$avg_happy) / max(avg_suicide_year_germany$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_happiness_year_germany, aes(x = Year, y = avg_happy), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_germany, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Happinness and Suicide Rate Over Time",
       x = "Year",
       y = "Average Happinness") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_happiness_year_germany$Year, avg_suicide_year_germany$Year)), 
                                  max(c(avg_happiness_year_germany$Year, avg_suicide_year_germany$Year)), by = 1))

Plot Average Bankruptcies and Average Suicide Rate over years

# Finding the ratio for scaling the second axis
ratio <- max(avg_bank_year_germany$avg_bank) / max(avg_suicide_year_germany$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_bank_year_germany, aes(x = Year, y = avg_bank), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_germany, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Bankruptcies and Suicide Rate Over Time",
       x = "Year",
       y = "Average Bankruptcies") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_bank_year_germany$Year, avg_suicide_year_germany$Year)), 
                                  max(c(avg_bank_year_germany$Year, avg_suicide_year_germany$Year)), by = 1))

In depth analysis South Africa

Data preperation for South Africa

# Prepare data for only Germany
SA_data = data %>% 
  filter(Country_name == "South Africa")

Plot GDP SA

# Plot SA GDP over Years
avg_gdp_year_SA = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_gdp = mean(GDP)) 



ggplot(avg_gdp_year_SA, aes(x = Year, y = avg_gdp)) +
  geom_line(color = "blue") +  
  labs(title = "Average GPD Over Time - SA",
       x = "Year",
       y = "GDP") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_gdp_year_SA$Year), max(avg_gdp_year_SA$Year), by = 1))

Plot Suicide Rate SA

avg_suicide_year_SA = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_suicide = mean(Suicide_Rate)) 


ggplot(avg_suicide_year_SA, aes(x = Year, y = avg_suicide)) +
  geom_line(color = "blue") +  
  labs(title = "Average Suicide Rate Over Time - SA",
       x = "Year",
       y = "Suicide Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_suicide_year_SA$Year), max(avg_suicide_year_SA$Year), by = 1))

Plot Happiness Rate SA

avg_happiness_year_SA = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_happy = mean(Life_ladder)) 


ggplot(avg_happiness_year_SA, aes(x = Year, y = avg_happy)) +
  geom_line(color = "blue") +  
  labs(title = "Average Happiness Over Time - SA",
       x = "Year",
       y = "Happiness") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_happiness_year_SA$Year), max(avg_happiness_year_SA$Year), by = 1))

Plot Bankruptcies Rate SA

avg_bank_year_sa = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_bank = mean(Value)) 


ggplot(avg_bank_year_sa, aes(x = Year, y = avg_bank)) +
  geom_line(color = "blue") +  
  labs(title = "Average bankruptcies Over Time - SA",
       x = "Year",
       y = "Bankruptcies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_bank_year_sa$Year), max(avg_bank_year_sa$Year), by = 1))

Plot Average GDP and Average Suicide Rate over years - SA

# Finding the ratio for scaling the second axis
ratio <- max(avg_gdp_year_SA$avg_gdp) / max(avg_suicide_year_SA$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_gdp_year_SA, aes(x = Year, y = avg_gdp), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_SA, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average GDP and Suicide Rate Over Time - SA",
       x = "Year",
       y = "Average GDP") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_gdp_year_SA$Year, avg_suicide_year_SA$Year)), 
                                  max(c(avg_gdp_year_SA$Year, avg_suicide_year_SA$Year)), by = 1))

Plot Average Bankruptcies and Average Suicide Rate over years

# Finding the ratio for scaling the second axis
ratio <- max(avg_bank_year_sa$avg_bank) / max(avg_suicide_year_SA$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_bank_year_sa, aes(x = Year, y = avg_bank), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_SA, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Bankruptcies and Suicide Rate Over Time - SA",
       x = "Year",
       y = "Average Bankruptcies") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_bank_year_sa$Year, avg_suicide_year_SA$Year)), 
                                  max(c(avg_bank_year_sa$Year, avg_suicide_year_SA$Year)), by = 1))

Plot Average Happinness and Average Suicide Rate over years - SA

# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_year_SA$avg_happy) / max(avg_suicide_year_SA$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_happiness_year_SA, aes(x = Year, y = avg_happy), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_SA, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Happinness and Suicide Rate Over Time - SA",
       x = "Year",
       y = "Average Happinness") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_happiness_year_SA$Year, avg_suicide_year_SA$Year)), 
                                  max(c(avg_happiness_year_SA$Year, avg_suicide_year_SA$Year)), by = 1))

Conclusion and Summary

This project provided a fascinating opportunity to delve into an unfamiliar topic, applying theoretical methods learned in class to a practical, real-world project. We performed an in-depth exploration using four distinct datasets: Suicide Rate, GDP per Capita, Bankruptcies, and Happiness Index. The years were spanning from 2011 to 2019. Our initial approach involved analyzing each dataset independently to understand its composition and conducting preliminary exploratory data analysis. This included tasks such as calculating and plotting the average annual values for each dataset across the globe.

Following this foundational analysis, we progressed to the data wrangling phase. We successfully merged the four datasets. The resultant combined dataset included data from 19 countries, providing a basis for further analysis. In the analytical phase, we wanted to uncover any potential correlations within the data set of these 19 countries. Our efforts to identify significant correlations among the datasets were unsuccessful.

To gain deeper insights, we narrowed our focus to two specific countries: Germany and South Africa. For each country, we conducted a detailed exploration of their data, plotting trends over the years and searching for any correlations between the variables within each national context. Our analysis did not reveal any significant correlations.

Possible Improvments

Given the absence of correlations in our initial analyses, we recognized the necessity to widen the dataset to enhance our study. Potential for expanding our dataset include incorporating additional variables such as alcohol and drug usage, unemployment rates, and sunshine hours in the countries studied. Additionally, transitioning from an annual to a monthly data overview might bring some benefit in exploring trends and correlations that were not visible in the yearly data.

Furthermore, to enhance the quality and depth of our analysis, focusing on more granular details could provide significant insights, particularly by examining various socio-economic and political factors that influence country-specific behaviors. Incorporating additional variables such as political stability, social conflicts, and specific cultural constructs could enrich our understanding of the correlations or lack thereof in the data. These factors often have profound impacts on economic conditions, happiness indices, and social issues like suicide rates and bankruptcy, providing a more nuanced and comprehensive framework for analysis.

If given the opportunity to revisit this project from the beginning, we would integrate these broader socio-political variables from the start, allowing for a more thorough initial data collection phase. This approach would enable us to capture a wider spectrum of influences, potentially revealing hidden patterns and correlations that were not evident in our previous analysis. Moreover, employing advanced statistical methods or machine learning techniques could further aid in identifying complex interactions between variables.

Continuing this project, our next steps would involve expanding our dataset to include these additional socio-political factors and applying more sophisticated analytical techniques. This could involve time-series analysis for trend detection or cluster analysis to identify similar behavioral patterns across different countries. By doing so, we aim to build a richer analytical model that can more accurately reflect the intricate realities influencing these critical societal indicators.

---
title: "Presentation - Don't commit Suicide"
author: "Marckenrold Cadet & Alexander Burkhart"
date: "`r Sys.Date()`"
output: 
   html_notebook:
       toc: true 
       toc_float: true
       toc_depth: 2
       theme: united
       highlight: tango
---
# Introduction

Suicide is the deliberate act of ending one’s own life, often stemming from various mental disorders such as depression, bipolar disorder, autism, schizophrenia, and personality disorders, as well as external stressors like financial struggles, academic pressures, relationship issues, or experiences of harassment and bullying. Factors like substance abuse, including alcoholism and benzodiazepine use, also contribute to this tragic outcome. Previous suicide attempts significantly elevate the risk of future attempts.
Efforts to prevent suicide involve a multifaceted approach, including restricting access to common methods like firearms, drugs, and poisons, addressing mental health issues and substance abuse, responsible media reporting on suicides, and fostering better economic conditions. Despite the widespread availability of crisis hotlines, their effectiveness remains inadequately researched.
The prevalence and methods of suicide vary across countries, often influenced by the accessibility of lethal means. Hanging, pesticide ingestion, poisoning, and firearms are among the most commonly used methods. Globally, suicides claim over 700,000 lives annually, ranking suicide as the 10th leading cause of death worldwide.
Approximately 1.5% of people die by suicide, translating to roughly 12 per 100,000 individuals each year. Men are more likely to die by suicide than women, with rates ranging from 1.5 times higher in developing countries to 3.5 times higher in developed ones. Financial strain often exacerbates the risk of suicide. [1]

# Related Work

## Title: A Study for Effects of Economic Growth Rate and Unemployment Rate to Suicide Rate in Korea

Author: Jong Soon Park, June Young Lee, Soon Duck Kim
Objectives: We investigated the effects of the economic growth and unemployment rates on the suicide rate in Korea, between 1983 and 2000, using a time-series regression model. The purpose of this study was to model and test the magnitude of the rate of suicide, with the Korean unemployment rate and GDP. [2]

Conclusion: It was found that the suicide rate was closely related to the National’s economic status of Korea, which is similar to the results found in studies in other countries. We expected, therefore, that this study could be used as the basis for further suicide-related studies. [2]

## Title: Ten-year evolution of suicide rates and economic indicators in large Brazilian urban centers

Author: Asevedo Elson, Ziebold Carolina, Diniz Elton, Gadelha Ary, Mari Jair
Purpose of review: This was a retrospective ecological study to examine the relationship between suicide rates and economic indicators in large Brazilian urban centers. Data on macroeconomic indicators (GDP and unemployment rates) and suicide rates of the largest Brazilian cities were collected from January 2006 to December 2015. [3]
Summary The effect of economic indicators was heterogeneous among the centers, but, overall, the variation in suicide rates was inversely related to unemployment and did not show a significant relationship with GDP. These findings indicate a more complex link between economics and suicide whenever looking at local regional indicators. Further research should focus on possible intervening factors, what may inform better preventive interventions. [3]

## Title: The association between nation-level social and economic indices and suicide rates: A pilot study

Author: Ravi Philip Rajkumar
Introduction Suicide is one of the leading causes of premature mortality worldwide. An analysis of global suicide data for the period 1990–2016, covering 195 countries, found that suicide was among the ten leading causes of years of life lost in Europe, the Americas and the Asia-Pacific region (Naghavi, 2019). Because of this, suicide prevention has been accorded one of the highest priorities in national and international public health plans, such as the World Health Organization’s Comprehensive Mental Health Action Plan (World Health Organization, 2021). The act of suicide is a complex behavior which results from the interaction between an innate diathesis and external stressors (van Heeringen, 2012). [4]
Conclusions Social factors are important determinants of suicide. Despite certain limitations, the results of the current study suggest that sustainable development may play a role in the mitigation of suicide risk, Economic inequality may contribute to variations in suicide risk, particularly in men. Moreover, the strength and direction of the associations between socioeconomic factors and suicide varies across income groups, highlighting the need for an in-depth understanding of each country’s social, cultural and economic profile when planning large-scale social interventions aimed at suicide prevention. [4]

## Title:Is Happiness the main cause for the rising suicide rate in the world?

Author: Bong Jin Haw, Chong Yock Loon, Tai Hong Ting, Teoh Kah Chun, Wong Chin Hing
Introduction This study investigates the impact of happiness on suicide rate. Beside theimpact of happiness, employment, gross domestic product, education, health expenditure and population on the suicide rate is also studied. The study divides the data into three sets including 31 developed and developing countries, 26 developed countries, and 5 developing countries. A significant relationship between happiness and the suicide rate is found in developing countries. [5]
Conclusions Surprisingly, based on the result it can be concluded that the happiness in developing countries will significantly impact and influence the suicide rate indeed. It is important to realize that the happiness is negatively affect the suicide rate in developing countries which means that increase in the happiness will subsequently lower down the suicide rate. Moreover, other economic factors such as employment, gross domestic product, government health expenditure, education and population are also included in this study to carry out their impact towards the suicide rate. [5]

## Sources
[1] https://www.iasp.info/wspd/references/ <br>
[2] https://www.jpmph.org/journal/view.php?number=50 (visited 03/24/2024) <br>
[3] https://journals.lww.com/co-psychiatry/abstract/2018/05000/ten_year_evolution_of_suicide_rates_and_economic.15.aspx
(visited 03/22/2024) <br>
[4] https://www.frontiersin.org/articles/10.3389/fsoc.2023.1123284/full <br>
[5] http://eprints.utar.edu.my/2869/1/fyp_BF_2018_BJH_-_1403312.pdf <br>


# Research Question

Suicide is not confined to high-income countries; it is a global issue affecting all regions. Surprisingly, over 77% of suicides occur in low- and middle-income countries. However, high-income countries exhibit the highest age-standardized suicide rates. In ongoing research across 19 countries, we explore correlations between GDP, bankruptcy rates, happiness levels, and suicide rates. Understanding these intricate relationships is pivotal for policymakers, researchers, and stakeholders to devise effective interventions promoting mental well-being and economic resilience. Leveraging multidimensional datasets and advanced analytical methods, our project aims to illuminate these connections for the betterment of societies worldwide.

# Data Wrangling

```{r}
# Load libraries
library(dplyr)
library(readr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(stringr)
library(readxl)
library(httr)
```

## Exploration of the Bankruptcies Dataset
```{r}

# Load dataset
bankruptcies <- read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/raw_data/Bankruptcies_2011-2020.csv")

# Display structure of dataset
str(bankruptcies)

# Show the first few rows 
head(bankruptcies)

# Summary statistics of numerical attributes
summary(bankruptcies)

# Summary statistics of categorical attribute
table(bankruptcies$Country)

# Check for missing values
sum(is.na(bankruptcies))

# Removing NA's columns and Time (Duplicate)  
data <- select(bankruptcies, -`Flag Codes`, -Flags, -`Reference Period Code`, -`Reference Period`, -TIME)
data %>% 
  sample_n(10)

# Display the structure of the modified dataset
str(data)

# Check the dimension of data
dim(data)

```

For this project, we are only focusing on information from 2011 to 2020. And to minimize the workload, we are analyzing 19 countries. 
```{r}
# Create new columns 'Quarter' and 'Year' from 'Time'
new_data <- mutate(data,
                   Quarter = str_sub(Time, 1, 2), # Extract Quarter
                   Year = as.numeric(str_sub(Time, 4))) # Extract Year and convert to numeric

# remove the original 'Time' column
new_data <- select(new_data, -Time)  
new_data

# Get date range
date_range <- range(new_data$Year)

# Count unique countries
n_countries <- length(unique(new_data$Country))

# Print the results
n_countries
date_range

# Only need data from 2011-2020, excluding 2021
bankruptcy_data <- filter(new_data, Year != 2021)
new_data

# Get date range of new updated dataset
date_range <- range(bankruptcy_data$Year)

# Extract start and end years from the range
start_year <- date_range[1]
end_year <- date_range[2]

# Count unique countries
n_countries <- length(unique(bankruptcy_data$Country))

# Print the results
cat("Number of country: ", n_countries, "\n") 
cat("Date range from: ", start_year, " to ", end_year, "\n")
```

## Exploring GDP dataset & analyze foundings
```{r}
# Load dataset 
my_data <- read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/raw_data/GDP_Data/GDP_capita_1960_2022.csv")
my_data
```

The Years are in the header, using `pivot_longer()` to convert into two column Year and GDP
```{r}
# Pivot the data from wide to long format
joy <- pivot_longer(my_data, 
                         cols = -c(`Country Name`),
                         names_to = "Year", 
                         values_to = "GDP")

# Show the first few rows

```


Perfom `summary stastics`, check for `NA's`, dimension `dim()`, etc. 
```{r}
# Display structure of dataset
str(joy)

# Show the first few rows 
head(joy)

# Summary statistics of numerical attributes
summary(joy)

# Summary statistics of categorical attribute
table(joy$`Indicator Code`)

# Check for missing values
sum(is.na(joy))

# Check the dimension of data
dim(joy)
```

Filter the data to only show the 19 countries and the date range the project is focused on.

```{r}
# Define the list of 19 countries
countries_of_interest <- c("Australia", "Belgium", "Brazil", "Canada", "Denmark",
                           "Finland", "France", "Germany", "Iceland", "Italy",
                           "Japan", "Netherlands", "New Zealand", "Norway", 
                           "South Africa", "Spain", "Sweden", "United Kingdom", 
                           "United States")

# Filter the data for the 19 countries
gdp <- joy %>%
  filter(`Country Name` %in% countries_of_interest)

# Define the date range of focus
start_year <- 2011
end_year <- 2020

# Filter the data for the date range
gdp <- gdp %>%
  filter(as.numeric(Year) >= start_year & as.numeric(Year) <= end_year)

# Show the filtered data
gdp %>% 
  sample_n(10)
```

```{r}
# Display structure of dataset
str(gdp)

# Convert Year to numeric 
# gdp$Year <- as.numeric(gdp$Year)

# Show the first few rows 
head(gdp)

# Summary statistics of numerical attributes
summary(gdp)

# Summary statistics of categorical attribute
table(gdp$`Indicator Code`)

# Check for missing values
sum(is.na(gdp))

# Check the dimension of data
dim(gdp)
```

```{r}
# Group the filtered data by Year and calculate the mean GDP for each year
mean_gdp_by_year <- gdp %>%
  group_by(Year) %>%
  summarize(mean_GDP = mean(GDP, na.rm = TRUE))

# Show the mean GDP for each year
print(mean_gdp_by_year)
```
Visualization for top 10 highest GDP by Counries Over Time.

```{r}
# Filter to include only the top 10 countries with the highest GDP values
top_10_gdp <- gdp %>%
  group_by(`Country Name`) %>%
  summarize(total_gdp = sum(GDP, na.rm = TRUE)) %>%
  top_n(10, total_gdp) %>%
  left_join(gdp, by = "Country Name")

# Convert GDP values to millions or billions
top_10_gdp <- top_10_gdp %>%
  mutate(GDP_formatted = case_when(
    GDP >= 1e12 ~ paste0(round(GDP / 1e12, 1), "T"),  # Convert to trillions
    GDP >= 1e9 ~ paste0(round(GDP / 1e9, 1), "B"),  # Convert to billions
    GDP >= 1e6 ~ paste0(round(GDP / 1e6, 1), "M"),   # Convert to millions
    TRUE ~ as.character(GDP)                        # Keep unchanged if less than 1 million
  ))

# Create a line plot of GDP over time with formatted values for the top 10 countries
ggplot(top_10_gdp, aes(x = Year, y = GDP, group = `Country Name`)) +
  geom_line(aes(color = `Country Name`)) +
  labs(title = "GDP Trends Over Time (Top 10 Countries)",
       x = "Year",
       y = "GDP",
       color = "Country") +
  scale_y_continuous(labels = function(x) paste0(x, "")) +  # Ensure y-axis labels are character type
  theme_minimal()
```


Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.


## Exploration of the Suicide Rate Dataset
```{r}
# Load data set from github
suicide_df = read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/raw_data/death-rate-from-suicides-gho%20new.csv")
```
```{r}
suicide_df$`Age-standardized suicide rate - Sex: both sexes` = as.double(suicide_df$`Age-standardized suicide rate - Sex: both sexes`)
summary(suicide_df)

# Consider removing Lesotho from the analysis 
suicide_df %>% 
  filter(is.na(`Age-standardized suicide rate - Sex: both sexes`))
```

```{r}
# Remove the Region / Income classes and Lesotho
suicide_df_countries = suicide_df %>% 
  filter(Code != "0", Code != "LSO")
```

```{r}
avg_country = suicide_df_countries %>% 
  group_by(Entity) %>% 
  summarise(avg_per_country = mean(`Age-standardized suicide rate - Sex: both sexes`)) %>% 
  arrange(desc(avg_per_country))
``` 

```{r}
ggplot(avg_country, aes(x = avg_per_country)) +
  geom_histogram() +
  labs(x = "Mean Suicide rate", y = "Frequency", title = "Distribution of mean of suicide rate") +
  theme_minimal()
```

## Exploration of the Happiness Dataset

```{r}
# Use the raw file URL
url <- "https://github.com/Alexburk93/Data_Wrangling_EDA/raw/main/data/raw_data/WHR20_DataForTable2.1.xls"
response <- GET(url)
content <- content(response, "raw")
temp <- tempfile(fileext = ".xls")
writeBin(content, temp)
happiness_df <- read_excel(temp)

# View the data
head(happiness_df)

```
```{r}
# create data frame with selected columns. Based on the description of the different variables.
happiness_df_filtered = happiness_df %>% 
  select(`Country name`, `year` , `Life Ladder`, `Social support`, `Healthy life expectancy at birth`, `Freedom to make life choices`, `Perceptions of corruption`)

summary(happiness_df_filtered)
```

```{r}
avg_happiness_per_country = happiness_df_filtered %>% 
  group_by(`Country name`) %>% 
  summarise(avg_happiness = mean(`Life Ladder`)) %>% 
  arrange(desc(avg_happiness))
```

```{r}
ggplot(avg_happiness_per_country, aes(x = avg_happiness)) +
  geom_histogram() +
  labs(x = "Mean Happiness Level", y = "Frequency", title = "Distribution of mean of happiness rate") +
  theme_minimal()
```

## Data for Final Presentation


Happiness Data for the 19 countries of interest from 2011 - 2019
```{r}
# Define the list of 19 countries
countries_of_interest <- c("Australia", "Belgium", "Brazil", "Canada", "Denmark",
                           "Finland", "France", "Germany", "Iceland", "Italy",
                           "Japan", "Netherlands", "New Zealand", "Norway", 
                           "South Africa", "Spain", "Sweden", "United Kingdom", 
                           "United States")

# Filter the data for the 19 countries
happiness_new<- happiness_df_filtered %>%
  filter(`Country name` %in% countries_of_interest)

# Define the date range of focus
start_year <- 2011
end_year <- 2020

# Filter the data for the date range
happiness_data <- happiness_new %>%
  filter(as.numeric(year) >= start_year & as.numeric(year) <= end_year)

# Show the filtered data
happiness_data %>% 
  sample_n(10)
```

```{r}
# Check if all countries of interest are present in the filtered data
missing_countries <- setdiff(countries_of_interest, unique(happiness_data$`Country name`))

# Print the missing countries, if any
if(length(missing_countries) > 0) {
  print("The following countries are missing from the filtered data:")
  print(missing_countries)
} else {
  print("All countries of interest are selected from the filtered data.")
}

```

```{r}
# Count the occurrences of each country in the filtered data
country_counts <- table(happiness_data$`Country name`)

# Print the country names and their counts
print("Country Name\t\tCount")
for (country in names(country_counts)) {
  cat(country, "\t\t\t", country_counts[country], "\n")
}

```

Suicide Data for the 19 countries of interest from 2011 - 2019
```{r}
# Change column name from `Entity` to `country name`
suicide_df <- rename(suicide_df, `Country name` = Entity)

# Define the list of 19 countries
countries_of_interest <- c("Australia", "Belgium", "Brazil", "Canada", "Denmark",
                           "Finland", "France", "Germany", "Iceland", "Italy",
                           "Japan", "Netherlands", "New Zealand", "Norway", 
                           "South Africa", "Spain", "Sweden", "United Kingdom", 
                           "United States")

# Filter the data for the 19 countries
new_suicide<- suicide_df %>%
  filter(`Country name` %in% countries_of_interest)

# Define the date range of focus
start_year <- 2011
end_year <- 2020

# Filter the data for the date range
suicide_data <- new_suicide %>%
  filter(as.numeric(Year) >= start_year & as.numeric(Year) <= end_year)

# Show the filtered data
suicide_data %>% 
  sample_n(10)
```

## Print all the data sets

```{r}
suicide_data %>% 
  sample_n(10)

happiness_data %>% 
  sample_n(10)

gdp %>% 
  sample_n(10)

bankruptcy_data %>% 
  sample_n(10)
```


## Label headers to align all the data set & view sample data
```{r}
gdp
bankruptcy_data
```


```{r}
# Change column name from `Entity` to `country name`
gdp_data <- rename(gdp, `Country name` = `Country Name`)
bankruptcy_data<- rename(bankruptcy_data, `Country name` = `Country`)
happiness_data = rename(happiness_data, `Year` = `year`)
suicide_data %>% 
  sample_n(10)

happiness_data %>% 
  sample_n(10)

gdp_data %>% 
  sample_n(10)

bankruptcy_data %>% 
  sample_n(10)
```
## Merging datasets

```{r}
gdp_data
happiness_data
bankruptcy_data
suicide_data
```

```{r}
# Convert "Year" column to character type in all datasets
suicide_data <- mutate(suicide_data, Year = as.character(Year))
happiness_data <- mutate(happiness_data, Year = as.character(Year))
gdp_data <- mutate(gdp_data, Year = as.character(Year))
bankruptcy_data <- mutate(bankruptcy_data, Year = as.character(Year))

# Merge the datasets
suicide_analysis <- suicide_data %>%
  left_join(happiness_data, by = c("Year", "Country name")) %>%
  left_join(gdp_data, by = c("Year", "Country name")) %>%
  left_join(bankruptcy_data, by = c("Year", "Country name"))

```
```{r}
# View sample data from the merged dataset
suicide_analysis 
```

Save the dataset in excel csv format to run analysis for the final project
```{r}

# # Drop specified columns
# suicide_analysis <- suicide_analysis %>%
#   select(-c("Code", "COU", "PowerCode", "PowerCode Code"))
# 
# # Save the merged dataset as a CSV file
# # Define the path to the folder on your desktop
# desktop_path <- "/home/alex/Uni/Master_US/2_Semester/Class_Data_Wrangeling_EDA/Data_Wrangling_EDA/data/"
# 
# # Create the folder if it doesn't exist
# dir.create(desktop_path, showWarnings = FALSE)
# 
# # Save the merged dataset as a CSV file in the specified folder
# write.csv(suicide_analysis, file.path(desktop_path, "suicide_analysis_2.csv"), row.names = FALSE)
```

```{r}
library(dplyr)
library(readr)
library(tidyverse)
library(ggplot2)
library(highcharter)
library(magrittr)
```
# EDA
## Data loading
```{r}
data <- read_csv("https://raw.githubusercontent.com/Alexburk93/Data_Wrangling_EDA/main/data/suicide_analysis_2.csv")
                data %>% sample_n(19)
```
## Renaming variables and drop unwanted columns
```{r}
# renaming the columns 
data <- data %>%
  rename(`Country_name` = `Country name`,
         `Suicide_Rate` = `Age-standardized suicide rate - Sex: both sexes`,
         `Life_ladder` = `Life Ladder`,
         `Social_support` = `Social support`,
         `Life_expectancy` = `Healthy life expectancy at birth`,
         `Freedom_choices` = `Freedom to make life choices`,
         `Corruption` = `Perceptions of corruption`)
```

```{r}
data
```
```{r}
data
```


```{r}
# drop columns 
# remove the original 'Time' column
data <- select(data, -Variable, -VAR, -MEA, -`Unit Code`)  
data
```

## Data exploration
```{r}
head(data)
```

```{r}
# change names
names(data) <- make.names(names(data))
```

```{r}
# dimensions of the dataframe
nrow(data)
ncol(data)
dim(data)
```

```{r}
# check the structure of the object
str(data)
```
```{r}
# look at columns 6, 7 and 10
head(data[ , c(2, 4:6, 12, 15)])
```

```{r}
# look at columns 6, 7 and 10
tail(data[ , c(1, 3, 9)])
```

```{r}
table(data$Year)
```


```{r}
data %>% 
  select(Country_name) %>% 
  unique() %>% 
  nrow()
```

```{r}
unique(data$Country_name)
```

```{r}
unique(data$Year)
```

## Interactive maps
```{r}
# Set highcharter options for tooltip decimals
options(highcharter.tooltip.valueDecimals = 2)

# Create highcharter map visualization
hc <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, value = "GDP", 
    joinBy = c('name', 'Country_name'),
    name = "GDP (current US$)"
  )  %>% 
  hc_colorAxis(stops = color_stops()) %>% 
  hc_title(text = "World Map") %>% 
  hc_subtitle(text = "GDP in current US$")

hc
```

```{r}
# Set highcharter options for tooltip decimals
options(highcharter.tooltip.valueDecimals = 2)

# Create map visualizations for each variable
hc_life_expectancy <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, 
    value = "Life_expectancy", 
    joinBy = c('name', 'Country_name'),
    name = "Life Expectancy"
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_title(text = "World Map") %>%
  hc_subtitle(text = "Life Expectancy")

hc_suicide_rates <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, 
    value = "Suicide_Rate", 
    joinBy = c('name', 'Country_name'),
    name = "Suicide Rates"
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_title(text = "World Map") %>%
  hc_subtitle(text = "Suicide Rate")

hc_corruption <- highchart() %>%
  hc_add_series_map(
    worldgeojson, data, 
    value = "Corruption", 
    joinBy = c('name', 'Country_name'),
    name = "Corruption"
  ) %>%
  hc_colorAxis(stops = color_stops()) %>%
  hc_title(text = "World Map") %>%
  hc_subtitle(text = "Corruption")

# Display the map visualizations
list(hc_life_expectancy, hc_suicide_rates, hc_corruption)

```
# Analysis 
## AVG GDP over years
### Calculation 
```{r}
avg_gdp_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_gpd = mean(`GDP`))

avg_gdp_per_year
```
### Plot 
```{r}
ggplot(avg_gdp_per_year, aes(x = Year, y = avg_gpd)) +
  geom_line(color = "blue") +  
  labs(title = "Average GDP Over Time worldwide",
       x = "Year",
       y = "GDP") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_gdp_per_year$Year), max(avg_gdp_per_year$Year), by = 1))
```
## AVG Happiness over years
### Calculation 
```{r}
avg_happiness_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_happinnes = mean(`Life_ladder`, na.rm = T))

avg_happiness_per_year
```
### Plot 

```{r}
ggplot(avg_happiness_per_year, aes(x = Year, y = avg_happinnes)) +
  geom_line(color = "blue") +  
  labs(title = "Average Happiness Over Time worldwide",
       x = "Year",
       y = "Happiness") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_happiness_per_year$Year), max(avg_happiness_per_year$Year), by = 1))
```

## AVG Suicide Rates over years
### Calculation 
```{r}
avg_Suicide_Rate_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_Suicide_Rate = mean(`Suicide_Rate`, na.rm = T))

avg_Suicide_Rate_per_year
```
### Plot 

```{r}
ggplot(avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate)) +
  geom_line(color = "blue") +  
  labs(title = "Average Suicide Rate Over Time worldwide",
       x = "Year",
       y = "Suicide Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_Suicide_Rate_per_year$Year), max(avg_Suicide_Rate_per_year$Year), by = 1))
```
## AVG Bankruptcies over years
### Calculation 
```{r}
avg_Bankruptcies_per_year <- data %>% 
  group_by (`Year`) %>% 
  summarise(avg_Bankruptcies = mean(`Value`, na.rm = T))

avg_Bankruptcies_per_year
```
### Plot 

```{r}
ggplot(avg_Bankruptcies_per_year, aes(x = Year, y = avg_Bankruptcies)) +
  geom_line(color = "blue") +  
  labs(title = "Average Bankruptcies Over Time worldwide",
       x = "Year",
       y = "Bankruptcies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_Bankruptcies_per_year$Year), max(avg_Bankruptcies_per_year$Year), by = 1))
```

## Plot Average GDP and Average Suicide Rate over years
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_gdp_per_year$avg_gpd) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_gdp_per_year, aes(x = Year, y = avg_gpd), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average GDP and Suicide Rate Over Time",
       x = "Year",
       y = "Average GDP") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_gdp_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_gdp_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

```

## Plot Average Happinness and Average Suicide Rate over years
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_per_year$avg_happinnes) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_happiness_per_year, aes(x = Year, y = avg_happinnes), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Happinness and Suicide Rate Over Time",
       x = "Year",
       y = "Average Happinness") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

```
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_per_year$avg_happinnes) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_happiness_per_year, aes(x = Year, y = avg_happinnes), size = 0.5) +
  # Adding the line plot for Suicide Rate adjusted by the ratio
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Setting up the titles and labels
  labs(title = "Average Happiness and Suicide Rate Over Time",
       x = "Year",
       y = "Average Happiness",
       subtitle = "Suicide rates are scaled to compare against happiness") +
  # Primary axis for Happiness, secondary axis for Suicide Rate (inversed scaling)
  scale_y_continuous(name = "Average Happiness",
                     sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  # Minimalist theme with angled x-axis texts
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  # Set x-axis breaks
  scale_x_continuous(breaks = seq(min(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_happiness_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

```

## Plot Average Bankruptcies and Average Suicide Rate over years
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_Bankruptcies_per_year$avg_Bankruptcies) / max(avg_Suicide_Rate_per_year$avg_Suicide_Rate)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_Bankruptcies_per_year, aes(x = Year, y = avg_Bankruptcies), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_Suicide_Rate_per_year, aes(x = Year, y = avg_Suicide_Rate * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Bankruptcies and Suicide Rate Over Time",
       x = "Year",
       y = "Average Bankruptcies") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_Bankruptcies_per_year$Year, avg_Suicide_Rate_per_year$Year)), 
                                  max(c(avg_Bankruptcies_per_year$Year, avg_Suicide_Rate_per_year$Year)), by = 1))

```

## Comparisions

## Two Happiest and Two unhappiest countries vs average suicide rate
```{r}
avg_Suicide_Rate_per_country = data %>%
  group_by(Country_name) %>%
  summarise(avg_suicide_rate = mean(Suicide_Rate, na.rm = TRUE)) %>%
  arrange(avg_suicide_rate) %>% 
  mutate(Row_Number = row_number())

avg_Suicide_Rate_per_country

avg_happiness_per_country <- data %>%
  group_by(Country_name) %>%
  summarise(avg_happiness = mean(Life_ladder, na.rm = TRUE)) %>%
  arrange(desc(avg_happiness))

least_happy =  tail(avg_happiness_per_country, 2)
most_happy = head(avg_happiness_per_country, 2)


avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% least_happy$Country_name)

# Interpretation: Japan and South Africa are two very unhappy countries. And they also have a high suicide rate

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% most_happy$Country_name)

# Interpretation: Finland is a the second most happy country. But is still on place 16/19 when it comes to suicides

```

### Two most wealthy countries and two most poor countries vs average suicide rate
```{r}
avg_gdp_per_country <- data %>% 
  group_by (`Country_name`) %>% 
  summarise(avg_gpd = mean(`GDP`)) %>% 
  arrange(desc(avg_gpd))

least_gdp =  tail(avg_gdp_per_country, 2)
most_gdp = head(avg_gdp_per_country, 2)

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% least_gdp$Country_name)

# Interpretation: Brazil with a low GPD per Capita is still low in the suicide ranking. SA has a bad GPA per Capita and is also bad in the suicide ranking  

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% most_gdp$Country_name)

# Interpretation: The two biggest economies are based in the middle of the suicide rating
```


## Two most bankcuptcies and two least bankcuptcies countries vs average suicide rate
```{r}
avg_Bankruptcies_per_year <- data %>% 
  group_by (`Country_name`) %>% 
  summarise(avg_bankruptcies = mean(`Value`, na.rm = T)) %>% 
  arrange(desc(avg_bankruptcies))

least_bank =  tail(avg_Bankruptcies_per_year, 2)
most_bank = head(avg_Bankruptcies_per_year, 2)

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% least_bank$Country_name)

# Interpretation: Bankruptcies don't have an influence on suicide rates

avg_Suicide_Rate_per_country %>% 
  filter(Country_name %in% most_bank$Country_name)

# Interpretation: Bankruptcies don't have an influence on suicide rates
```
## In depth analysis Germany
### Data preperation for Germany
```{r}
# Prepare data for only Germany
germany_data = data %>% 
  filter(Country_name == "Germany")
```

### Plot GDP Germany
```{r}
# Plot Germany GDP over Years
avg_gdp_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_gdp = mean(GDP)) 


ggplot(avg_gdp_year_germany, aes(x = Year, y = avg_gdp)) +
  geom_line(color = "blue") +  
  labs(title = "Average GPD Over Time - Germany",
       x = "Year",
       y = "GDP") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_gdp_year_germany$Year), max(avg_gdp_year_germany$Year), by = 1))

```

### Plot Suicide Rate Germany
```{r}
avg_suicide_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_suicide = mean(Suicide_Rate)) 


ggplot(avg_suicide_year_germany, aes(x = Year, y = avg_suicide)) +
  geom_line(color = "blue") +  
  labs(title = "Average Suicide Rate Over Time - Germany",
       x = "Year",
       y = "Suicide Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_suicide_year_germany$Year), max(avg_suicide_year_germany$Year), by = 1))
```
### Plot Bankruptcies Rate Germany
```{r}
avg_bank_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_bank = mean(Value)) 


ggplot(avg_bank_year_germany, aes(x = Year, y = avg_bank)) +
  geom_line(color = "blue") +  
  labs(title = "Average bankruptcies Over Time - Germany",
       x = "Year",
       y = "Bankruptcies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_bank_year_germany$Year), max(avg_bank_year_germany$Year), by = 1))
```
### Plot Happiness Rate Germany
```{r}
avg_happiness_year_germany = germany_data %>% 
  group_by(Year) %>% 
  summarise(avg_happy = mean(Life_ladder)) 


ggplot(avg_happiness_year_germany, aes(x = Year, y = avg_happy)) +
  geom_line(color = "blue") +  
  labs(title = "Average Happiness Over Time - Germany",
       x = "Year",
       y = "Happiness") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_happiness_year_germany$Year), max(avg_happiness_year_germany$Year), by = 1))
```

### Plot Average GDP and Average Suicide Rate over years - Germany
```{r}
avg_suicide_year_germany$avg_suicide
# Finding the ratio for scaling the second axis
ratio <- max(avg_gdp_year_germany$avg_gdp) / max(avg_suicide_year_germany$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_gdp_year_germany, aes(x = Year, y = avg_gdp), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_germany, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average GDP and Suicide Rate Over Time",
       x = "Year",
       y = "Average GDP") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_gdp_year_germany$Year, avg_suicide_year_germany$Year)), 
                                  max(c(avg_gdp_year_germany$Year, avg_suicide_year_germany$Year)), by = 1))
```

### Plot Average Happinness and Average Suicide Rate over years - Germany
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_year_germany$avg_happy) / max(avg_suicide_year_germany$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_happiness_year_germany, aes(x = Year, y = avg_happy), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_germany, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Happinness and Suicide Rate Over Time",
       x = "Year",
       y = "Average Happinness") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_happiness_year_germany$Year, avg_suicide_year_germany$Year)), 
                                  max(c(avg_happiness_year_germany$Year, avg_suicide_year_germany$Year)), by = 1))

```
### Plot Average Bankruptcies and Average Suicide Rate over years
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_bank_year_germany$avg_bank) / max(avg_suicide_year_germany$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_bank_year_germany, aes(x = Year, y = avg_bank), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_germany, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Bankruptcies and Suicide Rate Over Time",
       x = "Year",
       y = "Average Bankruptcies") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_bank_year_germany$Year, avg_suicide_year_germany$Year)), 
                                  max(c(avg_bank_year_germany$Year, avg_suicide_year_germany$Year)), by = 1))

```


## In depth analysis South Africa
### Data preperation for South Africa
```{r}
# Prepare data for only Germany
SA_data = data %>% 
  filter(Country_name == "South Africa")
```

### Plot GDP SA
```{r}
# Plot SA GDP over Years
avg_gdp_year_SA = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_gdp = mean(GDP)) 



ggplot(avg_gdp_year_SA, aes(x = Year, y = avg_gdp)) +
  geom_line(color = "blue") +  
  labs(title = "Average GPD Over Time - SA",
       x = "Year",
       y = "GDP") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_gdp_year_SA$Year), max(avg_gdp_year_SA$Year), by = 1))

```
### Plot Suicide Rate SA
```{r}
avg_suicide_year_SA = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_suicide = mean(Suicide_Rate)) 


ggplot(avg_suicide_year_SA, aes(x = Year, y = avg_suicide)) +
  geom_line(color = "blue") +  
  labs(title = "Average Suicide Rate Over Time - SA",
       x = "Year",
       y = "Suicide Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_suicide_year_SA$Year), max(avg_suicide_year_SA$Year), by = 1))
```

### Plot Happiness Rate SA
```{r}
avg_happiness_year_SA = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_happy = mean(Life_ladder)) 


ggplot(avg_happiness_year_SA, aes(x = Year, y = avg_happy)) +
  geom_line(color = "blue") +  
  labs(title = "Average Happiness Over Time - SA",
       x = "Year",
       y = "Happiness") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_happiness_year_SA$Year), max(avg_happiness_year_SA$Year), by = 1))
```
### Plot Bankruptcies Rate SA
```{r}
avg_bank_year_sa = SA_data %>% 
  group_by(Year) %>% 
  summarise(avg_bank = mean(Value)) 


ggplot(avg_bank_year_sa, aes(x = Year, y = avg_bank)) +
  geom_line(color = "blue") +  
  labs(title = "Average bankruptcies Over Time - SA",
       x = "Year",
       y = "Bankruptcies") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(avg_bank_year_sa$Year), max(avg_bank_year_sa$Year), by = 1))
```

### Plot Average GDP and Average Suicide Rate over years - SA
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_gdp_year_SA$avg_gdp) / max(avg_suicide_year_SA$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_gdp_year_SA, aes(x = Year, y = avg_gdp), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_SA, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average GDP and Suicide Rate Over Time - SA",
       x = "Year",
       y = "Average GDP") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_gdp_year_SA$Year, avg_suicide_year_SA$Year)), 
                                  max(c(avg_gdp_year_SA$Year, avg_suicide_year_SA$Year)), by = 1))
```
### Plot Average Bankruptcies and Average Suicide Rate over years
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_bank_year_sa$avg_bank) / max(avg_suicide_year_SA$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_bank_year_sa, aes(x = Year, y = avg_bank), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_SA, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Bankruptcies and Suicide Rate Over Time - SA",
       x = "Year",
       y = "Average Bankruptcies") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_bank_year_sa$Year, avg_suicide_year_SA$Year)), 
                                  max(c(avg_bank_year_sa$Year, avg_suicide_year_SA$Year)), by = 1))
```
### Plot Average Happinness and Average Suicide Rate over years - SA
```{r}
# Finding the ratio for scaling the second axis
ratio <- max(avg_happiness_year_SA$avg_happy) / max(avg_suicide_year_SA$avg_suicide)

# Creating the base plot
ggplot() +
  # Adding the bar plot for GDP
  geom_line(data = avg_happiness_year_SA, aes(x = Year, y = avg_happy), size = 0.5) +
  # Adding the line plot for Average Happiness
  geom_line(data = avg_suicide_year_SA, aes(x = Year, y = avg_suicide * ratio), color = "red", size = 0.5) +
  # Enhancing the plot
  labs(title = "Average Happinness and Suicide Rate Over Time - SA",
       x = "Year",
       y = "Average Happinness") +
  scale_y_continuous(sec.axis = sec_axis(~ . / ratio, name = "Average Suicide Rate")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(c(avg_happiness_year_SA$Year, avg_suicide_year_SA$Year)), 
                                  max(c(avg_happiness_year_SA$Year, avg_suicide_year_SA$Year)), by = 1))
```
# Conclusion and Summary

This project provided a fascinating opportunity to delve into an unfamiliar topic, applying theoretical methods learned in class to a practical, real-world project. We performed an in-depth exploration using four distinct datasets: Suicide Rate, GDP per Capita, Bankruptcies, and Happiness Index. The years were spanning from 2011 to 2019. Our initial approach involved analyzing each dataset independently to understand its composition and conducting preliminary exploratory data analysis. This included tasks such as calculating and plotting the average annual values for each dataset across the globe.

Following this foundational analysis, we progressed to the data wrangling phase. We successfully merged the four datasets. The resultant combined dataset  included data from 19 countries, providing a  basis for further analysis. In the analytical phase, we wanted to uncover any potential correlations within the data set of these 19 countries. Our efforts to identify significant correlations among the datasets were unsuccessful.

To gain deeper insights, we narrowed our focus to two specific countries: Germany and South Africa. For each country, we conducted a detailed exploration of their data, plotting trends over the years and searching for any correlations between the variables within each national context. Our analysis did not reveal any significant correlations.

# Possible Improvments

Given the absence of correlations in our initial analyses, we recognized the necessity to widen the dataset to enhance our study. Potential for expanding our dataset include incorporating additional variables such as alcohol and drug usage, unemployment rates, and sunshine hours in the countries studied. Additionally, transitioning from an annual to a monthly data overview might bring some benefit in exploring trends and correlations that were not visible in the yearly data.

Furthermore, to enhance the quality and depth of our analysis, focusing on more granular details could provide significant insights, particularly by examining various socio-economic and political factors that influence country-specific behaviors. Incorporating additional variables such as political stability, social conflicts, and specific cultural constructs could enrich our understanding of the correlations or lack thereof in the data. These factors often have profound impacts on economic conditions, happiness indices, and social issues like suicide rates and bankruptcy, providing a more nuanced and comprehensive framework for analysis.

If given the opportunity to revisit this project from the beginning, we would integrate these broader socio-political variables from the start, allowing for a more thorough initial data collection phase. This approach would enable us to capture a wider spectrum of influences, potentially revealing hidden patterns and correlations that were not evident in our previous analysis. Moreover, employing advanced statistical methods or machine learning techniques could further aid in identifying complex interactions between variables.

Continuing this project, our next steps would involve expanding our dataset to include these additional socio-political factors and applying more sophisticated analytical techniques. This could involve time-series analysis for trend detection or cluster analysis to identify similar behavioral patterns across different countries. By doing so, we aim to build a richer analytical model that can more accurately reflect the intricate realities influencing these critical societal indicators.